What is Ethnic Density?
Ethnic density is defined as the composition of each ethnic group residing in a geographical area of a given size (usually a fairly large geographical area, known as Lower Super Output Area (LSOA) which consists of around 1500 residents).
Here’s an image of London showing the ethnic density or ethnic composition of the city:
Ethnic Density of London
In the image above, the White British ethnic density is very high (indicated by the dominant dark green colour) across the city but especially in the outskirts of central London. There are pockets of high Asian ethnic density (dark blue), in the East and West of London. Non-British White ethnic groups (yellow) and Black ethnic groups (pink) have a reasonably high ethnic density in and around central London.
The ethnic densities vary across London, and different ethnic groups are dominant across different parts of London.
Measuring ethnic density using the ethnic density score
Based on where one lives and their ethnicity, every individual can be assigned with an ethnic density score. This score is simply the own ethnic groups (or own-group) ethnic density in the area they live in.
Calculating ethnic density score:
A residential area - "Area A" - has a total of 1500 residents.
The ethnic composition in Area A is:
500 residents of Indian descent,
250 residents of British descent and
750 residents of African descent.
The Indian Ethnic Density would be 500 divided by the total number of
residents 1500 (0.33).
That is, 33% of all individuals in this area are of Indian descent.
The British Ethnic Density would be 250 / 1500 (0.166).
The African Ethnic Density would be 750 / 1500 (0.50).
For Someone of Chinese descent who moves in to Area A, their ethnic density
score would be 1 divided by 1500 = ~0.00. Which indicates they live in an
area of low own-group ethnic density.
Someone of African descent would have an ethnic density of 50%. The African
person is livinig in an area of __high ethnic density__ (because there are
more of this ethnic group in Area A, relative to any other ethnic groups).
The ethnic density of a person hence is indicative of the type of area they live in, in terms of their ethnic composition. So whether they live in high ethnic density area, which is where there are more individuals of their own ethnicity or low ethnic density area, which is more individuals of another ethnicity in their residential area.
Why is Ethnic Density important?
Some studies are reporting that, in a multicultural cities, ethnic minorities living in areas where there are higher proportions of ethnic minority ethnicity may be better off (but in some cases worse) in terms of their mental and physical health relative to ethnic minority groups living in areas with larger proportions of the host ethnicity. This, beneficial effect on health by virtue of the ethnic composition in their residential area, is known as the ethnic density effect.
The figure below is an example suggesting the reporting levels of psychotic symptoms on relevant measures decreases among individuals living in areas of higher own-group ethnic density.
The Ethnic Density effect and Levels of Reporting Psychotic symptoms among White British women, Du Preez et al, 2016
Another example (Figure below), of the ethnic density effect in play in a majority of the ethnicities presented below. There seems to be a reverse effect in White British ethnic group.
Differing effects of Ethnic Density in Different Ethnicities, Das Munshi et al, 2012
As can be seen, the ethnic density effect may not manifest consistently among all ethnic groups, but there is evidence of a protective effect against mental health outcomes.
Studies demonstrating the positive Ethnic Density Effect on Mental Health
This ethnic density “effect” was first reported in 1939 a study by Faris and Dunham. Their study based in Chicago showed that White people, living in areas where Black ethnic groups were predominant, had a higher rate of schizophrenia (137.4 cases per 100,000), compared to the Black residents (39.4 cases per 100,000), where the overall area prevalence rate for schizophrenia was 50.4 cases per 100,000. In another study, Halpern and Nazroo used a nationwide community survey in England and Wales to explore the association of ethnic density and reported on levels of psychiatric symptoms. They showed a negative correlation of own-group ethnic density with neurotic symptoms, such as fatigue, sleep, depression and anxiety, (r = -0.087). That is, with a increase in ethnic density, there is a decrease in the levels of neurotic symptoms. Similarly, they found a negative association of ethnic density with psychotic symptoms (r = -0.113).
Studies demonstrating the a complex mechanism of the effect
Varying degrees of the effects of ethnic density (own-group or combined ethnic minority) on physical and mental health demonstrated positive effects of ethnic density on health outcomes but also detrimental effects in some ethnicities and not others. For example, among Black groups this association is largely reversed with increased risk of premature and all-cause mortality among Black groups with increasing Black ethnic density. The mechanism of the ethnic density effect is complex and requires a deeper understanding of ethnic groups and cultures.
Ethnic Density Effect and Suicidality
There is some evidence of the ethnic density effect being protective, for ethnic minority groups in the community, against suicide-related behaviours. In 2012, a review was published summarising the effect of ethnic density on mental health outcomes, which included suicide-related behaviours (2 studies) [Shaw et al, 2012]. Both studies found reduced risk of self-harm behaviour and completed suicide among ethnic minority groups with increasing ethnic density.
In one study, the rates of A&E attendance for self-harm were compared among White, African-Caribbean and Asian groups. They found that, as the ethnic minority densities increased, the self-harm referral rates of ethnic minorities fell relative to White self-harm referral rate with a risk ratio (RR) of 1.24 (95% CI: 0.69 – 2.10) in lower ethnic minority density versus an RR of 0.61 (0.47 – 0.79) in higher ethnic minority density areas.
In the other study, Neeleman used coroner’s records for completed suicide data to determine subjects’ ethnicity background to generate White and non-White ethnic density for each subject. They found that, as ethnic minority density increased, suicide rates were higher among the White ethnic group with an RR of 1.18 (1.02 – 1.37) and lower among ethnic minority groups RR of 0.75 (0.59 – 0.96).
Individuals diagnosed with certain mental disorders seen in secondary mental health care have a particularly high risk of suicide mortality compared to the general population. Whether the ethnic density effect has any impact on this risk is not clear.
The aim is to determine if there is an association of ethnic density with completed suicide in this secondary mental health care setting. In other words, this project will aim to study whether living in an area of high or low ethnic density (i.e. surrounded more by people of the same ethnicity or not) has any effect on completed suicide, in mental illness.
The data is derived from a mental health clinical trust in South London and provides mental healthcare for an area with a population of around 1.4 million residents, to individuals, who are referred by GPs, privately referred, A&E and self-referrals, seeking treatment for mild to severe mental health problems.
The trust uses electronic system to record day-to-day patient interactions (medical, demographic, clinical intervention etc) in either structured notes or free-text fields.
In 2008, a research facility was founded which used this a pseudonymised version of this electronic system from the South London trust for research and clinical audit purposes. Currently there are ~270000 records in this research database. For this project, a subset of patients, and related variables, were extracted based on an inclusion criteria (see below) to create the dataset for this project.
The dataset consists of 47851 patients.
The patients in the dataset were included if they met the following inclusion criteria:
They had an active referral (in the form of face to face contact) at any point between the observation window of 1st of January 2008 and 31st of December 2014.
They had a clinical diagnosis of depression, schizophrenia, schizoaffective, bipolar disorder, manic disorder and alcohol abuse. For patients with multiple diagnoses, the date of diagnosis closest to the observation start date was selected.
They had an area-level address (LSOA code) recorded (to merge with census data). For patients with multiple area-level addresses, the closest address to the date of diagnosis was selected.
They had a known ethnicity recorded (each patients’ ethnicity and ethnic composition in their LSOA was used to assign an ethnic density score) .
Each individual in the cohort is diagnosed with one or more of the disorders mentioned in the table below.
| Diagnosis | N | Number of Suicides |
|---|---|---|
| Schizophrenia | ||
| No | 38091 | 190 |
| Yes | 9438 | 72 |
| Schizoaffective | ||
| No | 46199 | 252 |
| Yes | 1330 | 10 |
| Bipolar | ||
| No | 42197 | 216 |
| Yes | 5332 | 46 |
| Substance Abuse | ||
| No | 30030 | 169 |
| Yes | 17499 | 93 |
| Depressive | ||
| No | 27023 | 152 |
| Yes | 20506 | 110 |
| Manic Disorder | ||
| No | 45487 | 251 |
| Yes | 2042 | 11 |
Data notes
This original data is loaded in R and is named ed. What follows is the process on how ed is cleaned. After cleaning ed, the dataset is renamed edclean.
edclean is the dataset to load and use for feature engineering, Exploratory Data Analysis (EDA) and final analysis.
Structure of the ed dataset
On first glance the str(ed) output (results not shown) indicates:
Missing value map on the ed dataset
The Figure above displays all the variables in the “ed” dataset by each row. The missing values are color coded (red for values, non-red for missing values).
There are columns with no values (columns D and C) and redundant values (“JunkID”).
Diagnosis and diagnosis date columns contain more than 50% NA values. The NA values imply patients who are not diagnosed with the particular disorder. Will replace NA with 0.
Some variables with variables names not decipherable and need ranaming (e.g. “AL” = OtherBlack_EDPercent). Id mappings available from original Stata database (not shown).
The exposure variable “ethnic density scores” will need to generated using the available columns in the dataset.
Details of each variable needs to be cleaned after table() and table(is.na())` funtions and after looking at the missing map
| Variable | Notes |
|---|---|
| JunkID | can be deleted, redundant |
| Gender_Cleaned | There are 2 “empty” Gender cells. Will replace them with NA |
| DOB_Cleaned | one NA value in row 45501, not sure what to do with it, will leave until further analysis |
| Marital_Cleaned | 3803 blank values, no NA values. 3803 blank values assigned to NA |
| primary_diagnosis | can be deleted, redundant. lots of different diagnosis in unstructured and structured format. and is potentially redundant because there are other variables that are flag variables for the main diagnoses. |
| ethnicitycleaned | is fine, no NA or blank values |
| imd_score | 49 NA values, not sure what to do here. will leave it in until further anlysis. |
| diagnosis_date to Bipolar_Diag_Date | these variables are dates of the main diagnoses and binary flag variables for diagnoses, NA in these variables means patient does not have the disorder. So perhaps better to change it to 0 instead. |
| ons_date_of_death | 5310 deaths in the cohort |
| Suicide | 263 suicides in the cohort |
| ICD10_UnderlyingCause | redundant variable, delete |
| LSOAClosestToDiagnosis | LSOA area level address code |
| LSOA11 | redundant, delete |
| lsoa01 | redundant, delete |
| LSOA_NAME | name of boroughs |
| C and D | are empty delete |
| All_Usual_Residents to AP | variable names are not informative and need to be renamed. These variables are ethnic density (number and percentage of people of different ethnic groups in the corresponding LSOA code) will be needed to generate the main exposure variable “ethnicdensityscore” |
Below is code to clean data based on table and map above
Cleaning the Gender variable
table(ed$Gender_Cleaned)
table(is.na(ed$Gender_Cleaned)) # no NA values, 2 blank ("") values.
# Female Male
# 2 23106 24473
#recoding blank gender values as NA
ed$Gender_Cleaned[ed$Gender_Cleaned == ""] <- NA
table(ed$Gender_Cleaned)
table(is.na(ed$Gender_Cleaned))
#identify where gender NA rows are
which(is.na(ed$Gender_Cleaned)) # row 682 and 30844
samples_to_remove <- which(is.na(ed$Gender_Cleaned)) # row 682 and 30844
#removing samples from original data "ed"
# overwriting ed with new ed.
ed <- ed[-samples_to_remove, ]
table(is.na(ed$Gender_Cleaned))
table(ed$Gender_Cleaned)
# Female Male
# 23106 24473
Cleaning the Marital Status variable
table(ed$Marital_Cleaned) # 3803 blank ("") values.
table(is.na(ed$Marital_Cleaned)) # no NA values
# replacing blank values in Marital_Cleaned variable with "Unknown"
ed$Marital_Cleaned[ed$Marital_Cleaned == ""] <- "Unknown"
table(ed$Marital_Cleaned) # New category name is "Unknown" = 3803
Cleaning the Diagnoses columns
# replacing NAs with 0 in the diagnosis columns
ed$Schizophrenia_Diag[is.na(ed$Schizophrenia_Diag)] <- 0
ed$SchizoAffective_Diag[is.na(ed$SchizoAffective_Diag)] <- 0
ed$Bipolar_Diag[is.na(ed$Bipolar_Diag)] <- 0
ed$Depressive_Diag[is.na(ed$Depressive_Diag)] <- 0
ed$Manic_Diag[is.na(ed$Manic_Diag)] <- 0
ed$SubAbuse_Diag[is.na(ed$SubAbuse_Diag)] <- 0
Renaming and deleting redundant variables
#renaming columns
ed <- ed %>% rename(TotalResidentsInLSOA = All_Usual_Residents,
WhiteBrit_EDPercent = G,
WhiteIrish_EDPercent = White_Irish_Percentage,
OtherWhite_EDPercent = White_Other_White_GypsyIrishTrav,
WhiteBlackCarib_EDPercent = P,
WhiteBlackAfri_EDPercent = R,
WhiteAsian_EDPercent = T,
OtherMixed_EDPercent = V,
BritIndian_EDPercent = Asian_Asian_British_Indian_Perce,
BritPakistani_EDPercent = Asian_Asian_British_Pakistani_Pe,
BritBangladeshi_EDPercent = AB,
BritChinese_EDPercent = Asian_Asian_British_Chinese_Perc,
OtherAsian_EDPercent = Asian_Asian_British_OtherAsian_P,
African_EDPercent = AH,
Caribbean_EDPercent = AJ,
OtherBlack_EDPercent = AL,
WhiteBrit_Residents = White_English_Welsh_Scottish_Nor,
TotalIrish_Residents = White_Irish_Count,
OtherWhite_Residents = White_Gypsy_Irish_Traveller_Coun,
MixedCaribbean_Residents = Mixed_Multiple_Ethnic_Groups_Whi,
MixedAsian_Residents = S,
OtherMixed_Residents = Mixed_Multiple_Ethnic_Groups_Oth,
BritIndian_Residents = Asian_Asian_British_Indian_Count,
BritPakistani_Residents = Asian_Asian_British_Pakistani_Co,
BritBangladeshi_Residents = Asian_Asian_British_Bangladeshi_,
BritChinese_Residents = Asian_Asian_British_Chinese_Coun,
OtherAsian_Residents = Asian_Asian_British_OtherAsian_C,
African_Residents = Black_African_Caribbean_BlackBr,
Caribbean_Residents = Black_African_Caribbean_Black_Br,
OtherBlack_Residents = AK,
OtherEthnicity_Residents = Other_Ethnic_Group_AnyOtherEthni)
#***************************************************************************
# Columns to delete
# C and D
# lsoa01
# ICD10_UnderlyingCause
# primary_diagnosis
# JunkID
dim(ed) # 47579 67
ed <- ed %>% select(-C, -D, -lsoa01, -ICD10_UnderlyingCause, -primary_diagnosis, -JunkID)
dim(ed) # 47579 61
Cleaned data saved as “Cleaned_Data_ED.Rdata”. This contains dataframe ed.
# save ed
save(ed, file = "Cleaned_Data_ED.Rdata", compress = TRUE)
Creating a new dataset called: edclean, which will be a copy of the original dataset ed and consist of new features.
List of new features added:
The main exposure variable: Ethnic Density Score (named “ethnicdensityscore”)
Ethnicity (named “ethnicity”)
Age variables (named “ageatdiagnosis”, “ageatdeath” and “agegroups”) - Age at Diagnosis“,”Age at Death" and “Age Group”
Borough variable (named “LSOA_4boroughs”): Croydon, Lewisham, Southwark, Lambeth and Other
Cause of Death Variable (named “DeathBy”)
Below are the code used to generate the listed new features.
Code generating the Ethnic Density Score (named “ethnicdensityscore”)
- The main exposure variable of the dataset is Ethnic Density (ED) score.
Ethnic density is defined as the composition of each ethnic group residing
in a geographical area of a given size (usually a fairly large geographical
area, known as Lower Super Output Area (LSOA) which consists of around
1500 residents).
- Since, in the original dataset `ed`, each patient was already assigned
to ethnic density (for every ethnic group) based on their LSOA code, to
assign OWN ethnicity ethnic density score to each individual, the relevant
ethnic density score based on patient ethnicity was selected and assigned
to each patient.
# Code generating an ethnic density score for each patient
edclean <- ed %>%
mutate(ethnicdensityscore =
ifelse(ethnicitycleaned == "British (A)", WhiteBrit_EDPercent,
ifelse(ethnicitycleaned == "African (N)", African_EDPercent,
ifelse(ethnicitycleaned == "Irish (B)", WhiteIrish_EDPercent,
ifelse(ethnicitycleaned == "Any other Asian background (L)", OtherAsian_EDPercent,
ifelse(ethnicitycleaned == "Any other black background (P)", OtherBlack_EDPercent,
ifelse(ethnicitycleaned == "Any other mixed background (G)", OtherMixed_EDPercent,
ifelse(ethnicitycleaned == "Any other white background (C)", OtherWhite_EDPercent,
ifelse(ethnicitycleaned == "Bangladeshi (K)", BritBangladeshi_EDPercent,
ifelse(ethnicitycleaned == "Caribbean (M)", Caribbean_EDPercent,
ifelse(ethnicitycleaned == "Chinese (R)", BritChinese_EDPercent,
ifelse(ethnicitycleaned == "Indian (H)", BritIndian_EDPercent,
ifelse(ethnicitycleaned == "Pakistani (J)", BritPakistani_EDPercent,
ifelse(ethnicitycleaned == "White and Asian (F)", WhiteAsian_EDPercent,
ifelse(ethnicitycleaned == "White and Black African (E)", WhiteBlackAfri_EDPercent,
ifelse(ethnicitycleaned == "White and black Caribbean (D)", WhiteBlackCarib_EDPercent,
" " )))))))))))))))) %>%
mutate(ethnicdensityscore = as.numeric(ethnicdensityscore))
# save ed with new feature into new dataset "edclean"
save(edclean, file = "Data_ED_new_features.Rdata", compress = TRUE)
Code generating the Ethnicity variable (named “ethnicity”)
Some of the ethnic groups are too small in numbers, these groups
will be aggregated while still maintaining their own-ethnic group ethnic
density scores and their respective suicide numbers also
decrease (data not shown), which make it difficult to analyse and could
potentially bias estimates towards ethnic groups that have larger sizes.
Grouping ethnic groups into larger ethnic categories (White, Other White,
Irish, Black, Other Black, Caribbean, Asian and Mixed Race).
See code feature_engineering_ethnicity in the Capstone_Project_Draft.Rmd
for code.
Generating Age variables (named “ageatdiagnosis”, “ageatdeath” and “agegroups”)
The dataset only has pseudonymised date of birth variables
and age can be generated from them using date of diagnosis.
See Chunk "feature_engineering_create_age_variables" in
Capstone_Project_Draft.Rmd for code.
Summaries and counts provided below.
# summary(edclean$ageatdiagnosis)
# Age at Diagnosis Summary
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.00 30.00 40.00 42.57 52.00 104.00
# table(edclean$agegroups)
# Patient counts by age groups
# < 25 26-40 41-60 61-100
# 7726 16075 16532 7196
# summary(edclean$ageatdeath)
# Age at Death Summary
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 10.00 52.00 69.00 66.49 82.00 107.00 42222
Code for generating Borough variable (named “LSOA_4boroughs”)
# tabulating patient LSOA code
# these need to grouped into the main boroughs for this particular Trust
# table(edclean$LSOA_NAME)
# generating borough variable ("LSOA_4boroughs")
edclean <- edclean %>%
mutate (LSOA_4boroughs =
ifelse(grepl("^Southw", edclean$LSOA_NAME) %in% TRUE, "SOUTHWARK",
ifelse(grepl("^Croy", edclean$LSOA_NAME) %in% TRUE, "CROYDON",
ifelse(grepl("^Lambe", edclean$LSOA_NAME) %in% TRUE, "LAMBETH",
ifelse(grepl("^Lewish", edclean$LSOA_NAME) %in% TRUE, "LEWSIHAM",
"OTHER")))))
x <- as.data.frame(table(edclean$LSOA_4boroughs))
kable(x)
# |Var1 | Freq|
# |:---------|-----:|
# |CROYDON | 10419|
# |LAMBETH | 9979|
# |LEWSIHAM | 8135|
# |OTHER | 9941|
# |SOUTHWARK | 9104|
table(is.na(edclean$LSOA_4boroughs)) # no NA values
# save new feature into new dataset "edclean"
save(edclean, file = "Data_ED_new_features.Rdata", compress = TRUE)
Generating Cause of Death Variable (named “DeathBy”)
- Create flag variables for individuals:
i) who died by suicide
ii) who died of other causes
iii) who are not dead
- See code feature_engineering_cause_of_death in
Capstone_Project_Draft.Rmd for code.
- Count summary provided below.
table(edclean$DeathBy)
# NotDied OtherCause Suicide
# 42222 5045 262
This dataset edclean contains the patient demographics, ethnic density score (main exposure variable), Suicide variable (main outcome variable), confounding variable and other variables.
The edclean dataset is the starting point for the rest of the analyses. The Data Dictionary for it dataset is provided below (See _Data Dictionary__).
# generating the final dataset
edclean <- edclean %>%
select(Gender_Cleaned, DOB_Cleaned, Marital_Cleaned,
diagnosisdate, ageatdiagnosis, Schizophrenia_Diag,
SchizoAffective_Diag,Depressive_Diag,SubAbuse_Diag,
Manic_Diag,Bipolar_Diag,ethnicitycleaned,ethnicity,
ethnicdensityscore,imd_score,dateofdeath,Suicide,
ageatdeath,agegroups,LSOA_4boroughs,LSOA11,DeathBy,
TotalResidentsInLSOA, WhiteBrit_EDPercent, WhiteIrish_EDPercent,
OtherWhite_EDPercent, WhiteBlackCarib_EDPercent,
WhiteBlackAfri_EDPercent, WhiteAsian_EDPercent,
OtherMixed_EDPercent, BritIndian_EDPercent,
BritPakistani_EDPercent, BritBangladeshi_EDPercent,
BritChinese_EDPercent, OtherAsian_EDPercent,
African_EDPercent, Caribbean_EDPercent,
OtherBlack_EDPercent)
# this is important before doing regression analysis (as is converting all the
# other categorical variables into factors).
edclean$Suicide <- as.factor(edclean$Suicide)
# save new feature into new dataset "edclean"
save(edclean, file = "Data_ED_new_features.Rdata", compress = TRUE)
This section describes basic exploratory data analysis with the outcome, Suicide, and the exposure variable (ethnic density score) and any interactions and associations with other available variables of interest (age, gender, marital status, area-level deprivation and borough.)
Suicide and its unadjusted association with the variables of interest and the main exposure are first analysed. This is followed by exploring ethnic density score and its association with relevant variables.
Exploring Death by Suicide
The following tables shows the distribution of deaths by Suicide (0=No, 1=Yes) and the demographic variables. A chi-square test is also conducted to test for association between the factors and the outcome of Suicide.
Suicide and Gender
# descriptive_table_suicide_gender
x <- as.data.frame(table(edclean$Gender_Cleaned,edclean$Suicide)) %>% spread(Var2,Freq)
names(x) <- c("Gender","0","1")
xstat <- chisq.test(edclean$Suicide, edclean$Gender_Cleaned)
x$Chi <- ""
x$P <- ""
x[2,4] <- round(xstat$statistic, 3)
x[2,5] <- signif(xstat$p.value, 2)
kable(x)
| Gender | 0 | 1 | Chi | P |
|---|---|---|---|---|
| Female | 23003 | 80 | ||
| Male | 24264 | 182 | 33.57 | 6.9e-09 |
Suicide and Age Groups
a <- as.data.frame(table(edclean$agegroups, edclean$Suicide)) %>% spread(Var2, Freq)
names(a) <- c("Age Groups", "0", "1")
a$Chi <- ""
a$P <- ""
astat <- chisq.test(edclean$agegroups, edclean$Suicide)
a[4,4] <- round(astat$statistic, 3)
a[4,5] <- signif(astat$p.value, 2)
kable(a)
| Age Groups | 0 | 1 | Chi | P |
|---|---|---|---|---|
| < 25 | 7705 | 21 | ||
| 26-40 | 15977 | 98 | ||
| 41-60 | 16422 | 110 | ||
| 61-100 | 7163 | 33 | 17.06 | 0.00069 |
Suicide and Marital Status
b <- as.data.frame(table(edclean$Marital_Cleaned, edclean$Suicide)) %>% spread(Var2, Freq)
names(b) <- c("Marital Status", "0", "1")
b$Chi <- ""
b$P <- ""
bstat <- chisq.test(edclean$Marital_Cleaned, edclean$Suicide)
b[4,4] <- round(bstat$statistic, 3)
b[4,5] <- signif(bstat$p.value, 2)
kable(b)
| Marital Status | 0 | 1 | Chi | P |
|---|---|---|---|---|
| Divorced / Separated / Widowed | 7366 | 34 | ||
| Married / Cohabiting | 8345 | 46 | ||
| Single | 27790 | 156 | ||
| Unknown | 3766 | 26 | 2.413 | 0.49 |
Suicide and Ethnicity
c <- as.data.frame(table(edclean$ethnicity, edclean$Suicide)) %>% spread(Var2, Freq)
names(c) <- c("Ethnicity", "0", "1")
c$Chi <- ""
c$P <- ""
cstat <- chisq.test(edclean$ethnicity, edclean$Suicide)
c[8,4] <- round(cstat$statistic, 3)
c[8,5] <- signif(cstat$p.value, 2)
kable(c)
| Ethnicity | 0 | 1 | Chi | P |
|---|---|---|---|---|
| Asian | 2536 | 11 | ||
| Black | 3277 | 16 | ||
| Caribbean | 2742 | 13 | ||
| Irish | 1558 | 9 | ||
| Mixed Race | 1293 | 8 | ||
| Other Black | 3739 | 9 | ||
| Other White | 4365 | 20 | ||
| White | 27757 | 176 | 11.855 | 0.11 |
Suicide and Borough
d <- as.data.frame(table(edclean$LSOA_4boroughs, edclean$Suicide)) %>% spread(Var2, Freq)
names(d) <- c("Borough", "0", "1")
d$Chi <- ""
d$P <- ""
dstat <- chisq.test(edclean$LSOA_4boroughs, edclean$Suicide)
d[5,4] <- round(dstat$statistic, 3)
d[5,5] <- signif(dstat$p.value, 2)
kable(d)
| Borough | 0 | 1 | Chi | P |
|---|---|---|---|---|
| CROYDON | 10370 | 49 | ||
| LAMBETH | 9934 | 45 | ||
| LEWSIHAM | 8090 | 45 | ||
| OTHER | 9810 | 82 | ||
| SOUTHWARK | 9063 | 41 | 18.684 | 0.00091 |
Suicide and Deprivation
qplot(Suicide, imd_score, data = edclean, main = "Area-level Deprivation by Suicide") +
geom_boxplot() +
xlab("Suicide (1 = Yes, 0 = No)") +
ylab("Area-level Deprivation Score")
# Comparing mean deprivation score by Suicides vs Non-Suicides
edclean %>% select(imd_score, Suicide) %>% group_by(Suicide) %>% summarise(Mean = mean(imd_score), S.D = sd(imd_score), N = n()) %>% kable()
| Suicide | Mean | S.D | N |
|---|---|---|---|
| 0 | 29.42796 | 10.87067 | 47267 |
| 1 | 28.54211 | 11.34683 | 262 |
# t-test to check for significant differences in mean deprivation score
with(edclean, t.test(imd_score[Suicide == 0], imd_score[Suicide == 1], conf.level = 0.95, paired = FALSE))
##
## Welch Two Sample t-test
##
## data: imd_score[Suicide == 0] and imd_score[Suicide == 1]
## t = 1.2605, df = 263.66, p-value = 0.2086
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4979416 2.2696497
## sample estimates:
## mean of x mean of y
## 29.42796 28.54211
Suicide and Ethnic Density Score
qplot(Suicide, ethnicdensityscore, data = edclean, main = "Ethnic Density Scores by Suicide") +
geom_boxplot() +
xlab("Suicide (1 = Yes, 0 = No)") +
ylab("Ethnic Density Score")
# Comparing mean ethnic density score by Suicides vs Non-Suicides
edclean %>% select(ethnicdensityscore, Suicide) %>%
group_by(Suicide) %>%
summarise(Mean = mean(ethnicdensityscore), S.D = sd(ethnicdensityscore), N = n()) %>%
kable()
| Suicide | Mean | S.D | N |
|---|---|---|---|
| 0 | 32.92411 | 25.90010 | 47267 |
| 1 | 38.20856 | 27.02586 | 262 |
T-test to check for significant differences in mean deprivation score
# t-test to check for significant differences in mean deprivation score
with(edclean, t.test(ethnicdensityscore[Suicide == 0], ethnicdensityscore[Suicide == 1], conf.level = 0.95, paired = FALSE))
##
## Welch Two Sample t-test
##
## data: ethnicdensityscore[Suicide == 0] and ethnicdensityscore[Suicide == 1]
## t = -3.157, df = 263.66, p-value = 0.001779
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.580377 -1.988521
## sample estimates:
## mean of x mean of y
## 32.92411 38.20856
The difference of mean ethnic density score by Suicide is significantly different. Mean ethnic density score among those who died by suicide is significantly higher compared to those who did not die by suicide.
Suicide and Ethnic Density Score by Ethnicity
From the Suicide and Ethnicity table, due to the large proportion of White ethnic group and the fact that ethnic density score are generated using ethnicity it is worth exploring ethnic density distribution by suicide within each ethnic group.
Below are boxplots of the ethnic density distributions by Suicide within each ethnic group.
| ethnicity | Suicide | Mean | S.D | N |
|---|---|---|---|---|
| Asian | 0 | 5.061238 | 5.7740780 | 2536 |
| Asian | 1 | 9.663636 | 13.8316501 | 11 |
| Black | 0 | 15.210070 | 8.9196143 | 3277 |
| Black | 1 | 16.650000 | 10.0434390 | 16 |
| Caribbean | 0 | 10.780671 | 4.9713853 | 2742 |
| Caribbean | 1 | 9.715385 | 5.2925807 | 13 |
| Irish | 0 | 2.079076 | 0.8917008 | 1558 |
| Irish | 1 | 1.711111 | 0.4456581 | 9 |
| Mixed Race | 0 | 2.185460 | 1.1687715 | 1293 |
| Mixed Race | 1 | 1.937500 | 1.2861210 | 8 |
| Other Black | 0 | 4.814228 | 2.2278974 | 3739 |
| Other Black | 1 | 6.366667 | 1.7979155 | 9 |
| Other White | 0 | 12.389223 | 5.2981810 | 4365 |
| Other White | 1 | 12.112080 | 7.5515498 | 20 |
| White | 0 | 49.927589 | 20.1629816 | 27757 |
| White | 1 | 52.165909 | 21.2804693 | 176 |
There seems to be some differences in ethnic density means by suicide in each ethnic group. The ethnic density distribution by ethnic group differs in terms in of ranges (but this is explore further below when exploring ethnic density score distributions).
Below is a table Comparing Mean Ethnic Density by Suicie within each Ethnic Group using t-test within each ethnicity group.
| ethnicity | Suicide | Mean | S.D | N | T-test |
|---|---|---|---|---|---|
| Asian | 0 | 5.061238 | 5.7740780 | 2536 | t = -1.1032, df = 10.015, p-value = 0.2958 |
| Asian | 1 | 9.663636 | 13.8316501 | 11 | |
| Black | 0 | 15.210070 | 8.9196143 | 3277 | t = -0.57238, df = 15.116, p-value = 0.5755 |
| Black | 1 | 16.650000 | 10.0434390 | 16 | |
| Caribbean | 0 | 10.780671 | 4.9713853 | 2742 | t = 0.72421, df = 12.101, p-value = 0.4827 |
| Caribbean | 1 | 9.715385 | 5.2925807 | 13 | |
| Irish | 0 | 2.079076 | 0.8917008 | 1558 | t = 2.4488, df = 8.3743, p-value = 0.03872 |
| Irish | 1 | 1.711111 | 0.4456581 | 9 | |
| Mixed Race | 0 | 2.185460 | 1.1687715 | 1293 | t = 0.54392, df = 7.0717, p-value = 0.6032 |
| Mixed Race | 1 | 1.937500 | 1.2861210 | 8 | |
| Other Black | 0 | 4.814228 | 2.2278974 | 3739 | t = -2.5856, df = 8.0592, p-value = 0.03214 |
| Other Black | 1 | 6.366667 | 1.7979155 | 9 | |
| Other White | 0 | 12.389223 | 5.2981810 | 4365 | t = 0.16394, df = 19.086, p-value = 0.8715 |
| Other White | 1 | 12.112080 | 7.5515498 | 20 | |
| White | 0 | 49.927589 | 20.1629816 | 27757 | t = -1.3914, df = 177, p-value = 0.1658 |
| White | 1 | 52.165909 | 21.2804693 | 176 |
Comparing the ethnic density score means by Suicide in different ethnic groups, produces different results than when comparing means in the entire cohort (t = -3.157, p-value = 0.001779). The table above, shows the mean ethnic density scores are significantly different in the Irish and Other Black ethnic groups. But there is no difference in the other ethnic groups.
Exploring ethnic density distribution (main exposure)
Ethnic density distribution by Ethnicity
The plot above show the square root (for less noise) of the ethnic density score distribution by each ethnic group. It is clear that ethnic minority group have relatively smaller ethnic density distribution ranges and there are relatively fewer of their ethnicities known to mental health services compared to the White ethnic group.
Here is a summary table comparing means by ethnic groups.
| ethnicity | MEAN | MEDIAN | SD | N | VARIANCE | QT25 | QT75 |
|---|---|---|---|---|---|---|---|
| White | 49.942 | 46.9 | 20.171 | 27933 | 406.869 | 34.700 | 63.000 |
| Other White | 12.388 | 12.0 | 5.309 | 4385 | 28.185 | 8.693 | 15.466 |
| Other Black | 4.818 | 4.9 | 2.228 | 3748 | 4.964 | 3.200 | 6.200 |
| Black | 15.217 | 13.3 | 8.924 | 3293 | 79.638 | 8.600 | 20.100 |
| Caribbean | 10.776 | 10.6 | 4.972 | 2755 | 24.721 | 7.300 | 14.100 |
| Asian | 5.081 | 3.3 | 5.834 | 2547 | 34.036 | 1.600 | 6.200 |
| Irish | 2.077 | 2.0 | 0.890 | 1567 | 0.792 | 1.500 | 2.600 |
| Mixed Race | 2.184 | 2.0 | 1.169 | 1301 | 1.367 | 1.300 | 2.900 |
The plot and tables above indicates that the ethnic density distribution differs by ethnicity.
The non White British ethnic groups have an ethnic density distribution with a much less range than the White British ethnic density distribution. This potentially reflects under-representation of ethnic minority groups in mental health services in South East London.
The Irish, Mixed Race and Other Black races have limited range of ethnic density scores (all below 12%)
There seems to different levels of ethnic density “exposure” (depending on ethnicity). Whether these levels are representative of the ethnic density distributions for South East London cannot be determined.
The table below shows results from the ANOVA test conducted to determine if the difference in means are significant. The results show that there is a difference in ethnic density score means by ethnicity.
| Degrees of Freedom | F value | p-value | |
|---|---|---|---|
| ethnicity | 7 | 11372 | <0.001 |
| Residuals | 47521 |
Ethnic density distribution by Borough
The White ethnic group is the most represented group across the boroughs as shown below in the barplot.
They also have the largest ethnic density distributions across boroughs and compared to other ethnic groups as shown in the table below.
The mean Ethnic Density Score by Ethnicity and Borough
| LSOA_4boroughs | Asian | Black | Caribbean | Irish | Mixed Race | Other Black | Other White | White | N |
|---|---|---|---|---|---|---|---|---|---|
| CROYDON | 6.874 | 11.172 | 12.757 | 1.554 | 2.380 | 4.937 | 7.502 | 49.160 | 10419 |
| LAMBETH | 2.008 | 14.194 | 10.887 | 2.488 | 2.436 | 5.455 | 16.279 | 39.526 | 9979 |
| LEWSIHAM | 3.606 | 14.134 | 12.357 | 1.910 | 2.587 | 4.909 | 10.577 | 41.840 | 8135 |
| OTHER | 7.195 | 10.352 | 4.296 | 1.803 | 1.270 | 2.155 | 11.503 | 67.691 | 9892 |
| SOUTHWARK | 2.367 | 21.170 | 8.435 | 2.266 | 1.916 | 5.026 | 11.751 | 41.263 | 9104 |
In the plot below, across all boroughs the ethnic minority group all have mean scores below 50%, while the White ethnic density scores show a full range of scores (0 - 100%, mean score ranges from ~39% to ~67%).
The anova test show the difference in means by borough and ethnicity are statistically different.
lm_edscore_by_borough_and_ethnicity <- lm(ethnicdensityscore ~ as.factor(LSOA_4boroughs)*as.factor(ethnicity) , data = edclean)
anova(lm_edscore_by_borough_and_ethnicity)
## Analysis of Variance Table
##
## Response: ethnicdensityscore
## Df Sum Sq Mean Sq
## as.factor(LSOA_4boroughs) 4 5294953 1323738
## as.factor(ethnicity) 7 16970520 2424360
## as.factor(LSOA_4boroughs):as.factor(ethnicity) 28 1327188 47400
## Residuals 47489 8311998 175
## F value Pr(>F)
## as.factor(LSOA_4boroughs) 7562.92 < 2.2e-16 ***
## as.factor(ethnicity) 13851.11 < 2.2e-16 ***
## as.factor(LSOA_4boroughs):as.factor(ethnicity) 270.81 < 2.2e-16 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary: Exploring Ethnic Density
There may be a case of under-representation of non-White British ethnic groups in this cohort. In order, to answer our question “Does ethnic density predict suicide?”, analysing ethnic minority groups may not provide an unbiased answer. The final analysis could be conducted separately by each ethnic group or just within the White ethnic group (see table below for counts by Suicide and Ethnic group).
|Ethnicity | Sui:No| Sui:Yes|Chi |P |
|:-----------|-----: |---: |:------|:----|
|Asian | 2536 | 11 | | |
|Black | 3277 | 16 | | |
|Caribbean | 2742 | 13 | | |
|Irish | 1558 | 9 | | |
|Mixed Race | 1293 | 8 | | |
|Other Black | 3739 | 9 | | |
|Other White | 4365 | 20 | | |
|White | 27757 | 176 |11.855 |0.11 |
Before starting analysis, there is another part of the research related to ethnic density introducted here. This part will also undergo the same process of feature engineering and will also be explored and analysed later.
Introduction to Trust Ethnic Density and how does it differ from the original ethnic density score measure
On discussion with my mentor we came up with another potential research question: “How does each patient’s ethnic density in the community (population ethnic density) compare to their ethnic density within the Trust (trust ethnic density)?” In other words can we predict ethnic density within the trust by patients’ population ethnic density.
Defining Trust Ethnic Density
Like the original ethnic density score, which is defined as the composition of each ethnic group residing in a geographical area of a given size, the trust ethnic density is the percentage composition of each ethnic group in a given group of patients residing in the same LSOA code and who have been referred to the trust. Comparing the original ethnic density score (which will be referred to as the population ethnic density score) to the trust ethnic density score, can give an idea of whether being referred to mental health services can be explained in part by one’s population ethnic density.
What follows below is code for additional feature engineering and exploration of trust ethnic density (outcome) in the context of ethnic density (main exposure).
Additional Feature Engineering
Three new variables were generated - “LSOAsize”, “trust.ed” and “ratio”
Description of the new additional variables
| Variable | |
|---|---|
| LSOAsize | The total number of patients within a given LSOA code |
| trust.ed | The percentage ethnic composition within a given LSOA code |
| ratio | trust.ed divided by populaiton ethnic density score variable (ethnicdensityscore).Whether the ethnic density of a certain individual within the Trust is proportionate to the ethnic density of the individual in the community can be calculated using the ratio of Trust ethnic density to Population Ethnic Density |
load("Data_ED_new_features.Rdata")
LSOAethnicdensity <- edclean %>%
dplyr:::select(ethnicdensityscore, ethnicity, imd_score,
ageatdiagnosis, LSOA11, ethnicitycleaned,
LSOA_4boroughs, Suicide, DeathBy, Gender_Cleaned,
Marital_Cleaned, WhiteBrit_EDPercent, OtherWhite_EDPercent,
African_EDPercent) %>% group_by(LSOA11, ethnicity) %>%
mutate(ethcount = length(ethnicity)) %>%
group_by(LSOA11) %>%
mutate(LSOAsize = n(),
trust.ed = ((ethcount/LSOAsize)*100),
ratio = trust.ed/ethnicdensityscore) %>%
ungroup() %>%
distinct() %>%
mutate(Gender_Cleaned=factor(Gender_Cleaned, levels=c("Male","Female")),
Marital_Cleaned=factor(Marital_Cleaned,
levels=c("Unknown","Single","Married / Cohabiting","Divorced / Separated / Widowed")),
LSOA_4boroughs=factor(LSOA_4boroughs,
levels=c("OTHER","CROYDON","SOUTHWARK","LEWSIHAM","LAMBETH"))) %>%
mutate(Suicide = ifelse(Suicide == 0, "No", "Yes"))
LSOAethnicdensity$Suicide <- as.factor(LSOAethnicdensity$Suicide)
save(LSOAethnicdensity, file="LSOAethnicdensity.Rdata")
A brief look at the data to demonstrate how “ratio” links “ethnicdensityscore” and “trust.ed”.
| ethnicity | LSOAsize | ethnicdensityscore | trust.ed | ratio |
|---|---|---|---|---|
| White | 3 | 60.400000 | 66.66667 | 1.103753 |
| Other White | 3 | 15.230312 | 33.33333 | 2.188618 |
| White | 3 | 60.400000 | 66.66667 | 1.103753 |
| Black | 1 | 8.900000 | 100.00000 | 11.235955 |
| White | 2 | 15.400000 | 100.00000 | 6.493506 |
| White | 2 | 15.400000 | 100.00000 | 6.493506 |
| White | 1 | 54.300000 | 100.00000 | 1.841621 |
| Asian | 1 | 7.600000 | 100.00000 | 13.157895 |
| White | 1 | 54.800000 | 100.00000 | 1.824817 |
| White | 1 | 44.400000 | 100.00000 | 2.252252 |
| White | 2 | 48.100000 | 50.00000 | 1.039501 |
| Black | 2 | 16.400000 | 50.00000 | 3.048780 |
| Other White | 1 | 7.147297 | 100.00000 | 13.991304 |
| White | 1 | 50.200000 | 100.00000 | 1.992032 |
| White | 1 | 50.800000 | 100.00000 | 1.968504 |
Where ethnicdensityscore and trust.ed are roughly equal (i.e. the population ethnic density matches the trust ethnic density), the ratio is around 1.
Where ethnicdensityscore is low but trust.ed is high, the ratio are roughly equal to the factor by which trust.ed is higher than ethnicdensityscore.
Correlation of Trust ED and Population ED
The graph below shows a strong positive correlation between overall Trust Ethnic Density (trust.ed) and Population Ethnic Density (ethnicdensityscore). The correlation test follows the plot. The correlation by ethnicity is different as can be seen by the difference in coloured dots (representing different ethnic groups) in the plot below:
cor.test(LSOAethnicdensity$ethnicdensityscore, LSOAethnicdensity$trust.ed)
##
## Pearson's product-moment correlation
##
## data: LSOAethnicdensity$ethnicdensityscore and LSOAethnicdensity$trust.ed
## t = 338.42, df = 45849, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8424203 0.8476540
## sample estimates:
## cor
## 0.8450574
Trust ethnic density versus Population Ethnic Density Facetted by Ethnicity (see below for corresponding correlations)
The plots below show that the correlation between trust ethnic density and population ethnic density differs by ethnic group.
Within the White British ethnic group, there is a clear positive correlation between patient’s trust ethnic density scores and community ethnic density scores.
For the rest of the ethnic groups, there is a less clear correlation, almost no correlation (For the Irish, Mixed Race and Other Black ethnic group this could be because of the restricted range of ethnic density scores).
Table of correlations
| Pearson’s correlation value, p-value | |
|---|---|
| Asian | 0.35, < 2.2e-16 |
| Black | -0.045, 0.009 |
| Caribbean | -0.046, 0.01523 |
| Irish | 0.01658627, 0.5127 |
| Mixed Race | -0.2418812, < 2.2e-16 |
| Other Black | -0.1521063, < 2.2e-16 |
| Other White | 0.176316, < 2.2e-16 |
| White | 0.76, < 2.2e-16 |
Ratio versus Population Ethnic Density
ratio can be plotted by population ethnic density scores (ethnicdensityscore) to represent how much the trust ethnic density score can vary by the population ED.
Interpreting the Ratio variable in the plot above:
- The horizontal (red) line of 1 represents an optimal representation of
Ethnic Density (ED) in both the Trust and in the Community. That is
to say, if a patient had a 50% ethnic density in the community, they
also have a 50% or approximately 50% ethnic density within the Trust.
- The closer the RATIO value is to 1 the more equal the ratio of Trust
Ethnic Density to Population Ethnic Density is.
- The yellow trend line, (function: geom_smooth, which uses generalised
additive model (gam) with integrated smoothness estimation).
From the plot, there seems to be a recurring pattern in all ethnic groups. While most the patients in each ethnic groups have a proportionate representation (i.e. “ratio” is 1 or close to 1), patients (regardless of ethnic group) living in areas of less than 5% ethnic density (i.e. there are fewer than 5% of their own-group ethnicity in their residential area) tend to have a really high trust ED to community ED ratio (i.e. they are the only people from their LSOA code to be represented or known to mental health service).
There are some points to consider before analysis:
Ethnic density distributions vary by ethnicity, with all the ethnic minority groups living in areas where there are less than 50% of their ethnic group. This could be the true ethnic density range of these ethnic minority groups.
The White ethnic group however is “exposed” to the full range of the ethnic density distribution (i.e. 0% - 100%) and hence I examine completed suicides in this group only from here on.
The distribution of ethnic density differs by boroughs in each ethnic group.
Suicide is extremely rare and hence the data here are highly unbalanced. Final analysis will have to take this into consideration.
Borough, age groups and gender have some association with death by suicide.
The exploratory analysis revealed a negative correlation of population ethnic density and trust ethnic density. A second piece of analysis can be conducted to investigate the association of trust ethnic density and population ethnic density.
With this in mind, the analysis will be conducted in two parts:
The first analysis will aim to answer Can completed suicides be predicted by the ethnic density scores?
Suicide coded as 0 (Not Died by Suicide) or 1 (Died by Suicide)ethnicdensityscoreThe second analysis will answer Can we predict trust/sample ethnic density using population ethnic density scores?
trust.ed (scores)ethnicdensityscore (scores)For the first analysis, the association of ethnic density and completed suicide will be investigated in the White ethnic group. The White British ethnic groups form a large proportion of the entire cohort and are the most representated across boroughs. Exploring ethnic density and suicide in this group will allow to examine the “full effect” of ethnic density on deaths by suicide in a psychiatric healthcare setting.
For the second analysis, the association of the ethnic density and completed suicide will be investigated in the entire cohort.
Ethnic Density and Suicide: Logistic regression
Since Suicide is a binary outcome of 1 and 0 a logistic regression will be conducted to predict deaths by suicide using patient ethnic density score.
Model 1: Unadjusted analysis - Suicide and Ethnic density score only
##
## Call:
## glm(formula = Suicide ~ ethnicdensityscore, family = "binomial",
## data = dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.1279 -0.1160 -0.1111 -0.1076 3.2424
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.335226 0.205679 -25.940 <2e-16 ***
## ethnicdensityscore 0.005378 0.003668 1.466 0.143
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2134.5 on 27932 degrees of freedom
## Residual deviance: 2132.4 on 27931 degrees of freedom
## AIC: 2136.4
##
## Number of Fisher Scoring iterations: 8
Here is the exponential function of the estimate and the anova of the model to test for significant differences.
exp(logistic_model_base$coefficients)
# (Intercept) ethnicdensityscore
# 0.004818821 1.005392132
## anova
anova(logistic_model_base, test="Chisq")
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 27932 2134.5
## ethnicdensityscore 1 2.1205 27931 2132.4 0.1453
Model 2: Suicide and Ethnic density, with the other variables of interest Results from the logistic regresion and corresponding anova test.
##
## Call:
## glm(formula = Suicide ~ Gender_Cleaned + ageatdiagnosis + Marital_Cleaned +
## imd_score + LSOA_4boroughs + ethnicdensityscore, family = "binomial",
## data = dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2051 -0.1236 -0.1095 -0.0866 3.5356
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -3.989637 0.585128 -6.818
## Gender_CleanedFemale -0.665341 0.165314 -4.025
## ageatdiagnosis 0.005972 0.004751 1.257
## Marital_CleanedSingle -0.169396 0.256716 -0.660
## Marital_CleanedMarried / Cohabiting -0.337914 0.306962 -1.101
## Marital_CleanedDivorced / Separated / Widowed -0.522282 0.333981 -1.564
## imd_score -0.009147 0.008106 -1.129
## LSOA_4boroughsCROYDON -0.665546 0.236345 -2.816
## LSOA_4boroughsSOUTHWARK -0.537767 0.254998 -2.109
## LSOA_4boroughsLEWSIHAM -0.459912 0.257400 -1.787
## LSOA_4boroughsLAMBETH -0.635262 0.269999 -2.353
## ethnicdensityscore -0.003931 0.005165 -0.761
## Pr(>|z|)
## (Intercept) 9.21e-12 ***
## Gender_CleanedFemale 5.70e-05 ***
## ageatdiagnosis 0.20882
## Marital_CleanedSingle 0.50935
## Marital_CleanedMarried / Cohabiting 0.27097
## Marital_CleanedDivorced / Separated / Widowed 0.11786
## imd_score 0.25910
## LSOA_4boroughsCROYDON 0.00486 **
## LSOA_4boroughsSOUTHWARK 0.03495 *
## LSOA_4boroughsLEWSIHAM 0.07398 .
## LSOA_4boroughsLAMBETH 0.01863 *
## ethnicdensityscore 0.44667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2134.5 on 27932 degrees of freedom
## Residual deviance: 2098.4 on 27921 degrees of freedom
## AIC: 2122.4
##
## Number of Fisher Scoring iterations: 8
The exponential value of the estimate for ethnic density is (using the exp(logistic_model$coefficients) function) ~1 (0.996).
## anova
anova(logistic_model, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Suicide
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 27932 2134.5
## Gender_Cleaned 1 20.0083 27931 2114.5 7.711e-06 ***
## ageatdiagnosis 1 0.1469 27930 2114.3 0.70153
## Marital_Cleaned 3 2.9622 27927 2111.4 0.39749
## imd_score 1 2.0390 27926 2109.3 0.15331
## LSOA_4boroughs 4 10.3269 27922 2099.0 0.03527 *
## ethnicdensityscore 1 0.5770 27921 2098.4 0.44749
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The fully adjusted model shows no association of ethnic density with completed suicides. There is however association between gender and borough. Females are at decreased risk of suicide versus males. Compared to the “Other” borough, individuals living in Lambeth, Croydon and Southwark are at a decreased risk of completing suicide.
The forest plot below the corresponding odds ratios from the regression analysis above, clearly showing the reduced odds of suicide in females and in the boroughs mentioned.
Comparison of the base model (Model 1) and full model (Model 2)
# ANOVA
anova(logistic_model_base, logistic_model, test ="Chisq")
## Analysis of Deviance Table
##
## Model 1: Suicide ~ ethnicdensityscore
## Model 2: Suicide ~ Gender_Cleaned + ageatdiagnosis + Marital_Cleaned +
## imd_score + LSOA_4boroughs + ethnicdensityscore
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 27931 2132.4
## 2 27921 2098.4 10 33.94 0.0001891 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a significant difference between both models. The full model is possibly the better model as it adjusts or takes into consideration other potential confounding variables.
Selecting the better model
The Pseudo r^2 test is defined below and allows to select models based on “R^2” but for logistic regression analysis.
"Unlike linear regression with ordinary least squares estimation, there is no
R2 statistic which explains the proportion of variance in the dependent variable
that is explained by the predictors. However, there are a number of pseudo R2
metrics that could be of value. Most notable is McFadden’s R2, which is defined as
1−[ln(LM)/ln(L0)] where ln(LM) is the log likelihood value for the fitted model
and ln(L0) is the log likelihood for the null model with only an intercept as a
predictor. The measure ranges from 0 to just under 1, with values closer to zero
indicating that the model has no predictive power."
Results from the Pseudo R^2 test
| Base_Model_pseudo_R2 | Full_Model_pseudo_R2 | |
|---|---|---|
| McFadden | 0.0009934 | 0.016894 |
The table shows both models having little predictive power as both r^2 are close to 0. The fully adjusted model performs better than the base model (0.01 versus 0.001, respectively)
Conclusion
Given all other predictor variables, ethnic density is not associated with deaths by suicide.
Predictive Modelling
Even though the logistic regression above suggest that ethnic density is a weak predictor of death by suicide in mental health, there is a glaring issue with this analysis above in that it is not taking into account that suicide is a rare event (262 suicides out of ~47K observations). The data is unbalanced. Perhaps ethnic density could predict deaths by suicide better if the data were balanced.
In addition, the EDA strongly suggest that there is no association of ethnic density with suicide however an exercise in predictive modelling and to formally assess the ability of ethnic density score and the other variables’ ability to classify completed suicides, a generalised linear regression method was used to build a classification model using the R package caret. SMOTE is used to balance the data (http://search.r-project.org/library/performanceEstimation/html/smote.html). Model performance was assessed using area under the curve, sensitivity and specificity.
Model Building in Caret
The code below uses functions (trainControl and train) in the caret package to do the following:
Uses the full model (as.factor(Suicide) ~ ethnicdensityscore + Gender_Cleaned + ageatdiagnosis + Marital_Cleaned + imd_score + LSOA_4boroughs) to train (using the glm method) a balanced data (using the smote function).
The training is performed using caret's trainControl and train functions. It is set to use repeated cross validation (using repeatedcv) on the balanced dataset. The function is set so that the training and testing process is repeated 100 times on a 5-fold balanced dataset (4 sets for training, 1 set for testing), with the performance of each model in predicting the hold-out (testing) set being measured using selected performance metric.
The metric with which we are assessing how well this model predicts completed Suicide is the Receiving Operator Curve (ROC).
Here is the code that defines how to train the model
# small numbers in Suicide == YES class so not splitting into train and test
# resampling approach used instead
# trainControl: set training sampling and tuning parameters
# k-fold cv: 5 fold, repeated 20 times = 100 sample sets
# data not balanced so using SMOTE
control_smote_2class <- trainControl(method = "repeatedcv",
number = 5,
repeats = 20,
sampling = "smote",
summaryFunction = twoClassSummary,
returnResamp="all",
classProbs = TRUE,
savePredictions = "all",
returnData=TRUE)
Here is the code that builds a predictive model
# builiding the model: glm, binomial, select on best metric using "ROC" curve
mod_fit_smote_suicide <- train(as.factor(Suicide) ~ ethnicdensityscore + Gender_Cleaned + ageatdiagnosis + Marital_Cleaned + imd_score + LSOA_4boroughs,
data = edclean.vars.white,
method = "glm",
family="binomial",
trControl = control_smote_2class,
#tuneLength = 5,
metric = "ROC")
Performance of the model in Predicting Suicides
# the predictive summary can be given by printing the below code.
mod_fit_smote_suicide
## Generalized Linear Model
##
## 27933 samples
## 6 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 20 times)
## Summary of sample sizes: 22347, 22345, 22347, 22347, 22346, 22346, ...
## Addtional sampling using SMOTE
##
## Resampling results:
##
## ROC Sens Spec
## 0.5837908 0.7661935 0.3377857
##
##
# summary of model
# summary(mod_fit_smote_suicide)
# As in the unbalanced analysis, the analysis using the balanced model also tells us that ethnic density scores are not predictive of completed suicides, given the predictors. Gender, age and borough are associated with death by suicide. Compared to males, females are protected against suicide. With every unit increase in age, the risk of dying by suicide increases. Compared to the "OTHER" borough, all other boroughs are at lower risk of deathy by suicide.
ROC Curve
# roc curve
plot(rocCurve, legacy.axes=TRUE)
##
## Call:
## roc.default(response = edclean.vars.white$Suicide, predictor = pred[, "Yes"])
##
## Data: pred[, "Yes"] in 27757 controls (edclean.vars.white$Suicide No) < 176 cases (edclean.vars.white$Suicide Yes).
## Area under the curve: 0.6212
In terms of how well the model predicts suicide, the AUC value is 0.58. This means the predictive value of model is pretty poor. Looking at the sensitivity and the specificity. The plot above shows who poorly the model predicts completed suicides.
Results from the predict function
#Building a confustion matrix
pred <- predict(mod_fit_smote_suicide)
confusionMatrix(pred, reference=edclean.vars.white$Suicide, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 21192 111
## Yes 6565 65
##
## Accuracy : 0.761
## 95% CI : (0.756, 0.766)
## No Information Rate : 0.9937
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0069
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.369318
## Specificity : 0.763483
## Pos Pred Value : 0.009804
## Neg Pred Value : 0.994789
## Prevalence : 0.006301
## Detection Rate : 0.002327
## Detection Prevalence : 0.237354
## Balanced Accuracy : 0.566401
##
## 'Positive' Class : Yes
##
The confusion matrix shows us how times the model has correctly predicted suicide and provides us with other performance metrics.
Summary
Conclusion: Part 1 of Data Analysis
Results, suggest little clinical utility of the model (and ethnic density) for suicide prediction.
Population Ethnic Density and Trust Ethnic Density: Linear regression
This second part will answer Can we predict trust/sample ethnic density and ratio by population ethnic density scores?
ethnicdensityscore (scores)Note: this analysis will be conducted among the White ethnic group and where the “LSOAsize” (for definition see table “Description of the new additional variables”) is above 19. I had conducted the analysis on the entire cohort initially but this introduces patterns in the residual plots that disappear when LSOAsize are above 10 and when analysing by ethnic groups.
Plots of age, deprivation, trust ethnic density and population ethnic density score
The plot above show correlation between “trust.ed”, “ethnicdensityscore”, “imd_score” and “ageatdiagnosis”. There is no correlation of age with other variables. There is positive correlation between “trust.ed” and “ethnicdensityscore”. There is a negative correlation between deprivation scores (“imd_score”) and “trust.ed” and “ethnicdensityscore”.
Plotting Interaction Trees
par(mfrow = c(1,1))
library(tree)
model <- tree(trust.ed ~ ethnicdensityscore + ageatdiagnosis + Gender_Cleaned + LSOA_4boroughs + Marital_Cleaned + imd_score, data = subset(LSOAethnicdensity, LSOAsize > 19 & ethnicity == "White"))
plot(model)
text(model)
The interaction tree shows that borough could potentially be interacting with ethnic density score.
Model 1: Trust ethnic density (outcome)
From the EDA, the tree plot and pairs plot above the following model was built, which included interaction between population ethnic density score, borough and deprivation score.
“Linear regression: trust.ed ~ ethnicdensityscoreLSOA_4boroughsimd_score + ageatdiagnosis + Gender_Cleaned + Marital_Cleaned”
# Model 1
# Full model: Trust Ethnic Density and all predictors
# Linear regression: trust.ed ~ ethnicdensityscore*LSOA_4boroughs*imd_score + ageatdiagnosis + Gender_Cleaned + Marital_Cleaned
# The tree model suggests interactions between ethnic density score and borough.
# The literature suggests a negative correlation of ethnic density and deprivation in the White or host ethnic group.
linear.model.age.gender.demog <- lm(trust.ed ~ ethnicdensityscore*LSOA_4boroughs*imd_score + ageatdiagnosis + Gender_Cleaned + Marital_Cleaned, data = subset(LSOAethnicdensity, LSOAsize > 19 & ethnicity == "White"))
Summaring results from Model 1
#Summarising full model
summary(linear.model.age.gender.demog)
##
## Call:
## lm(formula = trust.ed ~ ethnicdensityscore * LSOA_4boroughs *
## imd_score + ageatdiagnosis + Gender_Cleaned + Marital_Cleaned,
## data = subset(LSOAethnicdensity, LSOAsize > 19 & ethnicity ==
## "White"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.634 -4.918 0.072 4.906 22.952
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 63.701313 3.052941
## ethnicdensityscore 0.142422 0.048258
## LSOA_4boroughsCROYDON -29.557386 3.154573
## LSOA_4boroughsSOUTHWARK -30.950198 3.515002
## LSOA_4boroughsLEWSIHAM -26.030816 3.595451
## LSOA_4boroughsLAMBETH -23.909759 3.496290
## imd_score -0.479064 0.075525
## ageatdiagnosis 0.003108 0.003357
## Gender_CleanedFemale 0.059806 0.112299
## Marital_CleanedSingle -0.185420 0.210040
## Marital_CleanedMarried / Cohabiting 0.233335 0.239439
## Marital_CleanedDivorced / Separated / Widowed 0.063023 0.246566
## ethnicdensityscore:LSOA_4boroughsCROYDON 0.471593 0.049984
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK 0.240070 0.058340
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM 0.283859 0.063514
## ethnicdensityscore:LSOA_4boroughsLAMBETH 0.103247 0.060114
## ethnicdensityscore:imd_score 0.011365 0.001280
## LSOA_4boroughsCROYDON:imd_score 0.422401 0.080457
## LSOA_4boroughsSOUTHWARK:imd_score 0.356883 0.089949
## LSOA_4boroughsLEWSIHAM:imd_score 0.233026 0.094302
## LSOA_4boroughsLAMBETH:imd_score 0.146249 0.088120
## ethnicdensityscore:LSOA_4boroughsCROYDON:imd_score -0.007982 0.001366
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK:imd_score -0.001371 0.001629
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM:imd_score -0.003737 0.001823
## ethnicdensityscore:LSOA_4boroughsLAMBETH:imd_score -0.004761 0.001667
## t value Pr(>|t|)
## (Intercept) 20.866 < 2e-16 ***
## ethnicdensityscore 2.951 0.00317 **
## LSOA_4boroughsCROYDON -9.370 < 2e-16 ***
## LSOA_4boroughsSOUTHWARK -8.805 < 2e-16 ***
## LSOA_4boroughsLEWSIHAM -7.240 4.65e-13 ***
## LSOA_4boroughsLAMBETH -6.839 8.22e-12 ***
## imd_score -6.343 2.30e-10 ***
## ageatdiagnosis 0.926 0.35444
## Gender_CleanedFemale 0.533 0.59434
## Marital_CleanedSingle -0.883 0.37736
## Marital_CleanedMarried / Cohabiting 0.975 0.32982
## Marital_CleanedDivorced / Separated / Widowed 0.256 0.79826
## ethnicdensityscore:LSOA_4boroughsCROYDON 9.435 < 2e-16 ***
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK 4.115 3.89e-05 ***
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM 4.469 7.89e-06 ***
## ethnicdensityscore:LSOA_4boroughsLAMBETH 1.718 0.08590 .
## ethnicdensityscore:imd_score 8.876 < 2e-16 ***
## LSOA_4boroughsCROYDON:imd_score 5.250 1.54e-07 ***
## LSOA_4boroughsSOUTHWARK:imd_score 3.968 7.28e-05 ***
## LSOA_4boroughsLEWSIHAM:imd_score 2.471 0.01348 *
## LSOA_4boroughsLAMBETH:imd_score 1.660 0.09700 .
## ethnicdensityscore:LSOA_4boroughsCROYDON:imd_score -5.842 5.23e-09 ***
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK:imd_score -0.842 0.40003
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM:imd_score -2.050 0.04038 *
## ethnicdensityscore:LSOA_4boroughsLAMBETH:imd_score -2.857 0.00429 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.924 on 20256 degrees of freedom
## Multiple R-squared: 0.7104, Adjusted R-squared: 0.71
## F-statistic: 2070 on 24 and 20256 DF, p-value: < 2.2e-16
# Multiple R-squared: 0.7104, Adjusted R-squared: 0.71
# non significant terms are ageatdiagnosis, gender and marital status.
anova(linear.model.age.gender.demog)
## Analysis of Variance Table
##
## Response: trust.ed
## Df Sum Sq Mean Sq
## ethnicdensityscore 1 2453957 2453957
## LSOA_4boroughs 4 586077 146519
## imd_score 1 26360 26360
## ageatdiagnosis 1 470 470
## Gender_Cleaned 1 107 107
## Marital_Cleaned 3 674 225
## ethnicdensityscore:LSOA_4boroughs 4 18011 4503
## ethnicdensityscore:imd_score 1 16569 16569
## LSOA_4boroughs:imd_score 4 13068 3267
## ethnicdensityscore:LSOA_4boroughs:imd_score 4 4034 1009
## Residuals 20256 1271824 63
## F value Pr(>F)
## ethnicdensityscore 39083.5322 < 2.2e-16 ***
## LSOA_4boroughs 2333.5716 < 2.2e-16 ***
## imd_score 419.8263 < 2.2e-16 ***
## ageatdiagnosis 7.4909 0.006206 **
## Gender_Cleaned 1.7045 0.191712
## Marital_Cleaned 3.5762 0.013303 *
## ethnicdensityscore:LSOA_4boroughs 71.7147 < 2.2e-16 ***
## ethnicdensityscore:imd_score 263.8906 < 2.2e-16 ***
## LSOA_4boroughs:imd_score 52.0321 < 2.2e-16 ***
## ethnicdensityscore:LSOA_4boroughs:imd_score 16.0624 3.884e-13 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The non-significant values are removed from the model and assessed using R^2.
# Model 2
linear.model.age.gender.demog.rm.age <- update(linear.model.age.gender.demog, ~. -ageatdiagnosis)
summary(linear.model.age.gender.demog.rm.age)
#Multiple R-squared: 0.7104, Adjusted R-squared: 0.71
# Model 3
linear.model.age.gender.demog.rm.age.gender <- update(linear.model.age.gender.demog.rm.age, ~. -Gender_Cleaned)
summary(linear.model.age.gender.demog.rm.age.gender)
#Multiple R-squared: 0.7104, Adjusted R-squared: 0.71
# Model 4
linear.model.age.gender.demog.rm.age.gender.marital <- update(linear.model.age.gender.demog.rm.age.gender, ~. -Marital_Cleaned)
summary(linear.model.age.gender.demog.rm.age.gender.marital)
#Multiple R-squared: 0.7102, Adjusted R-squared: 0.7099
All models perform roughly the same. Model 1 and Model 4 will be selected to check diagnostics.
Diagnostic plots
Model 1
plot(linear.model.age.gender.demog, which = c(1,2))
Model 4
plot(linear.model.age.gender.demog.rm.age.gender.marital, which = c(1,2))
Both models are equally good. Model 4 will be selected as the final model to predict trust ethnic density.
Output from Model 4
##
## Call:
## lm(formula = trust.ed ~ ethnicdensityscore + LSOA_4boroughs +
## imd_score + ethnicdensityscore:LSOA_4boroughs + ethnicdensityscore:imd_score +
## LSOA_4boroughs:imd_score + ethnicdensityscore:LSOA_4boroughs:imd_score,
## data = subset(LSOAethnicdensity, LSOAsize > 19 & ethnicity ==
## "White"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.483 -4.879 0.137 4.965 22.714
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 63.741631 3.045672
## ethnicdensityscore 0.143998 0.048260
## LSOA_4boroughsCROYDON -29.462539 3.154551
## LSOA_4boroughsSOUTHWARK -30.963113 3.515554
## LSOA_4boroughsLEWSIHAM -25.964733 3.595507
## LSOA_4boroughsLAMBETH -23.829325 3.496284
## imd_score -0.478910 0.075531
## ethnicdensityscore:LSOA_4boroughsCROYDON 0.471147 0.049988
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK 0.239621 0.058350
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM 0.282985 0.063519
## ethnicdensityscore:LSOA_4boroughsLAMBETH 0.101730 0.060116
## ethnicdensityscore:imd_score 0.011334 0.001280
## LSOA_4boroughsCROYDON:imd_score 0.420475 0.080457
## LSOA_4boroughsSOUTHWARK:imd_score 0.357999 0.089964
## LSOA_4boroughsLEWSIHAM:imd_score 0.231735 0.094310
## LSOA_4boroughsLAMBETH:imd_score 0.145644 0.088125
## ethnicdensityscore:LSOA_4boroughsCROYDON:imd_score -0.007953 0.001366
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK:imd_score -0.001345 0.001629
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM:imd_score -0.003688 0.001823
## ethnicdensityscore:LSOA_4boroughsLAMBETH:imd_score -0.004738 0.001667
## t value Pr(>|t|)
## (Intercept) 20.929 < 2e-16 ***
## ethnicdensityscore 2.984 0.00285 **
## LSOA_4boroughsCROYDON -9.340 < 2e-16 ***
## LSOA_4boroughsSOUTHWARK -8.807 < 2e-16 ***
## LSOA_4boroughsLEWSIHAM -7.221 5.33e-13 ***
## LSOA_4boroughsLAMBETH -6.816 9.65e-12 ***
## imd_score -6.341 2.34e-10 ***
## ethnicdensityscore:LSOA_4boroughsCROYDON 9.425 < 2e-16 ***
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK 4.107 4.03e-05 ***
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM 4.455 8.43e-06 ***
## ethnicdensityscore:LSOA_4boroughsLAMBETH 1.692 0.09062 .
## ethnicdensityscore:imd_score 8.851 < 2e-16 ***
## LSOA_4boroughsCROYDON:imd_score 5.226 1.75e-07 ***
## LSOA_4boroughsSOUTHWARK:imd_score 3.979 6.93e-05 ***
## LSOA_4boroughsLEWSIHAM:imd_score 2.457 0.01401 *
## LSOA_4boroughsLAMBETH:imd_score 1.653 0.09841 .
## ethnicdensityscore:LSOA_4boroughsCROYDON:imd_score -5.820 5.96e-09 ***
## ethnicdensityscore:LSOA_4boroughsSOUTHWARK:imd_score -0.825 0.40911
## ethnicdensityscore:LSOA_4boroughsLEWSIHAM:imd_score -2.023 0.04310 *
## ethnicdensityscore:LSOA_4boroughsLAMBETH:imd_score -2.843 0.00448 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.925 on 20261 degrees of freedom
## Multiple R-squared: 0.7102, Adjusted R-squared: 0.7099
## F-statistic: 2613 on 19 and 20261 DF, p-value: < 2.2e-16
From the summary, with every unit increase in ethnic density score, there is a 1.15 times increase in trust ethnic density. This reflects the results from the EDA plots in the White ethnic group. The trust ethnic density is strongly positively correlated with population ethnic density, which intuitively makes sense as well.
Ratio and Ethnic Density
Can the relationship seen with ethnic density and ratio be statistically shown?
Model 1
# summary(ModelA) # Multiple R-squared: 0.6626, Adjusted R-squared: 0.6622
Removing age as it is not significant (see Model B).
ModelB <- update(ModelA, ~. -ageatdiagnosis)
# summary(ModelB) #Multiple R-squared: 0.6626, Adjusted R-squared: 0.6622
Removing gender as it is not significant (see Model C).
ModelC <- update(ModelB, ~. -Gender_Cleaned)
# summary(ModelC) # Multiple R-squared: 0.6625, Adjusted R-squared: 0.6622
Removing marital status as it is not significant (see Model D).
ModelD <- update(ModelC, ~. -Marital_Cleaned)
# summary(ModelD) # Multiple R-squared: 0.6623, Adjusted R-squared: 0.662
Assessing by just looking at the R-squared, the performance of these models are similar. All suggesting a negative association of ethnic density score and ratio (as expected from EDA). With every increase in ethnic density score, the ratio decreases by 0.02. The association is very weak (OR 0.988) but it is significant.
Conclusion from Investigating population the association of ratio, trust ethnic density with population ethnic density
Ethnic density is a strong predictor of trust ethnic density. However, it can also predict the ratio comparing trust ethnic density with respective population ethnic density score.
The project initially started with investigation the relationship with Suicide and population ethnic density. To that effect, the association of Suicide with ethnic density was explored in exploratory data analysis and in the final analysis. We concluded that ethnic density is not a predictor of death by suicide in this clinical cohort. This is contradictory to the suggested effect of ethnic density and suicide related behaviour in a community setting, where there are indications of a protective effect. There could be several limitations to our results. We are assuming that each individual is exposed to the ethnic density score at the time of suicide as well.
During exploratory data analysis, the relationship between trust ethnic density and population ethnic density was uncovered. It turned out that an increase in ethnic density in the population did not mean a proportionate reflection in mental health services. In fact, individuals living in areas where there were very of their own ethnic residents, were most likely to be known to mental health services. This increased odds of being known to services as population ethnic density decreases was replicated across all ethnic groups. The results suggests that there could an ethnic density effect and that the lower this effect the higher the chances of experiencing mental health issues. Further work is required to explore this investigation fully.
Outcome
| Variable Name | Definition/Categories |
|---|---|
| Suicide | binary variable for patient who died by suicide; 0 = Not died by suicdie, 1 = died by suicide |
Exposure: Ethnic Density Score
| Variable Name | Definition/Categories |
|---|---|
| ethnicdensityscore | Defined as the percentage composition of each ethnic group residing in a geographical area of a given size. |
Demographics
| Variable Name | Definition/Categories |
|---|---|
| Gender_Cleaned | Female, Male, Unknown |
| Marital_Cleaned | Divorced / Separated / Widowed ; Married / Cohabiting ; Single ; Undisclosed |
| DOB_Cleaned | Patient date of birth (dob) |
| ethnicitycleaned | Patient ethnicity |
| ethnicity | Aggregated ethnic groups: White, Other White, Black, Asian and Mixed |
| imd_score | patient’s area level deprivation score. The higher the score, the more deprived the area |
| imd_quartiles | fill in text here! [ref] |
| ageatdeath | age at death (any cause of death) |
| ageatdiagnosis | age at primary diagnosis |
| agegroups | age groups according to age at diagnosis |
| LSOA_4boroughs | Boroughs; CROYDON; LAMBETH; LEWSIHAM; OTHER; SOUTHWARK |
Death related variables
| Variable Name | Definition/Categories |
|---|---|
| dateofdeath | date of death for patients who died |
| DeathBy | cause of death |
Diagnosis related variables
| Variable Name | Definition/Categories |
|---|---|
| primary_diagnosis | first diagnosis closest to the start of the observation window |
| diagnosisdate | date of primary diagnosis |
| Schizophrenia_Diag | binary variable to indicate if the patient has had a diagnosis of Schizophrenia disorder at some point during the observation window |
| SchizoAffective_Diag | binary variable to indicate if the patient has had a diagnosis of Schizoaffective disorder at some point during the observation window |
| Depressive_Diag | binary variable to indicate if the patient has had a diagnosis of Depressive disorder (mild to severe) at some point during the observation window |
| SubAbuse_Diag | binary variable to indicate if the patient has had a diagnosis of Substance Abuse disorder at some point during the observation window |
| Manic_Diag | binary variable to indicate if the patient has had a diagnosis of Manic disorder at some point during the observation window |
| Bipolar_Diag | binary variable to indicate if the patient has had a diagnosis of Bipolar disorder at some point during the observation window |
Overall Ethnic Density in each known LSOA variable
| Variable Name | Definition/Categories |
|---|---|
| LSOA11 | Each patients’ area-level address code. This geographical code, covers an ares of ~1500 residents |
| TotalResidentsInLSOA | The actual number of residents in the corresponding LSOA code |
| WhiteBrit_EDPercent | The percentage ethnic density, or ethnic composition, of White British ethnic group in the corresponding LSOA code |
| WhiteIrish_EDPercent | The percentage ethnic density, or ethnic composition, of Irish ethnic group in the corresponding LSOA code |
| OtherWhite_EDPercent | The percentage ethnic density, or ethnic composition, of White British ethnic group in the corresponding LSOA code |
| WhiteBlackCarib_EDPercent | The percentage ethnic density, or ethnic composition, of Mixed White and Black Caribbean ethnic group in the corresponding LSOA code |
| WhiteBlackAfri_EDPercent | The percentage ethnic density, or ethnic composition, of Mixed White and Black African ethnic group in the corresponding LSOA code |
| WhiteAsian_EDPercent | The percentage ethnic density, or ethnic composition, of Mixed White and Asian ethnic group in the corresponding LSOA code |
| OtherMixed_EDPercent | The percentage ethnic density, or ethnic composition, of any other Mixed Race ethnic group in the corresponding LSOA code |
| BritIndian_EDPercent | The percentage ethnic density, or ethnic composition, of the Indian ethnic group in the corresponding LSOA code |
| BritPakistani_EDPercent | The percentage ethnic density, or ethnic composition, of the Pakistani ethnic group in the corresponding LSOA code |
| BritBangladeshi_EDPercent | The percentage ethnic density, or ethnic composition, of the Bangladeshi ethnic group in the corresponding LSOA code |
| BritChinese_EDPercent | The percentage ethnic density, or ethnic composition, of the Chinese ethnic group in the corresponding LSOA code |
| OtherAsian_EDPercent | The percentage ethnic density, or ethnic composition, of any other Asian ethnic groups in the corresponding LSOA code |
| African_EDPercent | The percentage ethnic density, or ethnic composition, of Black British or African ethnic group in the corresponding LSOA code |
| Caribbean_EDPercent | The percentage ethnic density, or ethnic composition, of the Caribbean ethnic group in the corresponding LSOA code |
| OtherBlack_EDPercent | The percentage ethnic density, or ethnic composition, of any other Black ethnic group in the corresponding LSOA code |
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.5 (El Capitan)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] parallel grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] tree_1.0-37 pROC_1.8 doMC_1.3.4 iterators_1.0.8
## [5] foreach_1.4.3 DMwR_0.4.1 MASS_7.3-45 forestplot_1.4
## [9] magrittr_1.5 caret_6.0-68 lattice_0.20-33 Amelia_1.7.4
## [13] Rcpp_0.12.5 knitr_1.13 gridExtra_2.2.1 GGally_1.0.1
## [17] gmodels_2.16.2 ggplot2_2.1.0 dplyr_0.4.3 tidyr_0.4.1
## [21] lubridate_1.5.6 foreign_0.8-66
##
## loaded via a namespace (and not attached):
## [1] class_7.3-14 zoo_1.7-13 gtools_3.5.0
## [4] assertthat_0.1 digest_0.6.9 R6_2.1.2
## [7] plyr_1.8.3.9000 MatrixModels_0.4-1 stats4_3.3.0
## [10] e1071_1.6-7 evaluate_0.9 highr_0.6
## [13] gplots_3.0.1 lazyeval_0.1.10 minqa_1.2.4
## [16] gdata_2.17.0 SparseM_1.7 car_2.1-2
## [19] TTR_0.23-1 nloptr_1.0.4 rpart_4.1-10
## [22] Matrix_1.2-6 rmarkdown_0.9.6 labeling_0.3
## [25] splines_3.3.0 lme4_1.1-12 stringr_1.0.0
## [28] munsell_0.4.3 compiler_3.3.0 mgcv_1.8-12
## [31] htmltools_0.3.5 nnet_7.3-12 codetools_0.2-14
## [34] reshape_0.8.5 bitops_1.0-6 nlme_3.1-128
## [37] gtable_0.2.0 DBI_0.4-1 formatR_1.4
## [40] scales_0.4.0 KernSmooth_2.23-15 quantmod_0.4-5
## [43] stringi_1.0-1 ROCR_1.0-7 reshape2_1.4.1
## [46] xts_0.9-7 tools_3.3.0 abind_1.4-3
## [49] pbkrtest_0.4-6 yaml_2.1.13 colorspace_1.2-6
## [52] caTools_1.17.1 quantreg_5.24